fix(sandbox): refresh docker/podman/vm tokens in gateway by TaylorMutch · Pull Request #1639 · NVIDIA/OpenShell

TaylorMutch · 2026-05-29T19:50:55Z

Summary

Fix sandbox token recovery for singleplayer runtimes by making Docker, Podman, and VM token refresh gateway-managed, with an explicit supervisor auth mode contract for each backend. Kubernetes remains on ServiceAccount exchange instead of gateway-managed token files.

Related Issue

Closes #1603

Changes

Add explicit OPENSHELL_SANDBOX_AUTH_MODE values for static, gateway-managed file, gateway-managed supervisor-push, and Kubernetes ServiceAccount exchange paths
Refresh Docker, Podman, and VM sandbox tokens from the gateway during startup resume and periodic rotation
Add compute-driver resume/write-token RPCs and VM supervisor token push handling
Add Docker, Podman, and VM expired-token resume regression tests
Update architecture and driver docs for the explicit token ownership model

Testing

mise run pre-commit passes
Unit tests added/updated
E2E tests added/updated
Docker expired-token resume e2e passed
Podman expired-token resume e2e passed
VM expired-token resume e2e: attempted locally, blocked by missing mkfs.ext4/e2fsprogs prerequisite on this machine

Checklist

Follows Conventional Commits
Commits are signed off (DCO)

github-actions · 2026-05-29T19:51:06Z

Label test:e2e applied for df3f0bc. Open the existing run and click Re-run all jobs to execute with the label set. The run will execute the standard E2E suite after building the required gateway and supervisor images once. The matching required CI gate status on this PR will flip green automatically once the run finishes.

Signed-off-by: Taylor Mutch <taylormutch@gmail.com>

maxamillion · 2026-06-02T18:57:30Z

@TaylorMutch In the process of asking my coding agent questions to wrap my head around some of the implementation details here, I stumbled on the following potential issue where a silent failure can occur in the VM token rotation:

In crates/openshell-server/src/compute/mod.rs:887, the refresh sweep treats write_sandbox_token() success as a completed refresh, then optionally pushes the token over the supervisor stream. But send_sandbox_token_update() returns Ok(false) when no supervisor session is connected (crates/openshell-server/src/supervisor_session.rs:210), and the caller ignores that boolean. For VM, the driver-side write only updates the persisted sandbox request (crates/openshell-driver-vm/src/driver.rs:589); the running guest updates /opt/openshell/auth/sandbox.jwt only when it receives SandboxTokenUpdate (crates/openshell-sandbox/src/supervisor_session.rs:422). If a running VM’s supervisor stream is temporarily disconnected during refresh, the sweep records success while the guest keeps the old token. Once that old JWT expires, reconnects and outbound RPCs can keep failing because no connected session exists to receive the new token. Fix by treating Ok(false) as an undelivered refresh for supervisor-push runtimes and retrying, or by sending the latest minted token during VM supervisor session registration.

TaylorMutch · 2026-06-02T18:59:10Z

@TaylorMutch In the process of asking my coding agent questions to wrap my head around some of the implementation details here, I stumbled on the following potential issue where a silent failure can occur in the VM token rotation:

In crates/openshell-server/src/compute/mod.rs:887 (https://github.com/NVIDIA/OpenShell/pull/1639/files), the refresh sweep treats write_sandbox_token() success as a completed refresh, then optionally pushes the token over the supervisor stream. But send_sandbox_token_update() returns Ok(false) when no supervisor session is connected (crates/openshell-server/src/supervisor_session.rs:210), and the caller ignores that boolean. For VM, the driver-side write only updates the persisted sandbox request (crates/openshell-driver-vm/src/driver.rs:589); the running guest updates /opt/openshell/auth/sandbox.jwt only when it receives SandboxTokenUpdate (crates/openshell-sandbox/src/supervisor_session.rs:422). If a running VM’s supervisor stream is temporarily disconnected during refresh, the sweep records success while the guest keeps the old token. Once that old JWT expires, reconnects and outbound RPCs can keep failing because no connected session exists to receive the new token. Fix by treating Ok(false) as an undelivered refresh for supervisor-push runtimes and retrying, or by sending the latest minted token during VM supervisor session registration.

Yup, this is exactly where I've gotten hung up on this PR and haven't made much progress since I first posted it. The VM implementation in particular has been tricky, and I am definitely open to ideas here.

maxamillion · 2026-06-02T20:25:15Z

@TaylorMutch I think the main question I have is, is the expectation that the microvm ComputeDriver be a single-host single-player ComputeDriver option similar to podman and docker?

If so, we should be able to make certain assumptions that can make this easier to reason with. Effectively my pitch would be to toss the entire gateway managed file approach and instead use krun_add_virtiofs3 to expose the host OS directory to the VM as a read-only virtio-fs device. Then the refresh logic could be the same as it is for docker and podman. The is similar in approach to how podman --runtime krun volume bind mounts work, though I think what we would need here could be more simple and narrow in scope since we don't need to deal with the OCI or filesystem pre-prep.

TaylorMutch requested review from a team, derekwaynecarr, maxamillion and mrunalp as code owners May 29, 2026 19:50

TaylorMutch added the test:e2e Requires end-to-end coverage label May 29, 2026

fix(sandbox): refresh docker/podman/vm tokens in gateway

7afe767

Signed-off-by: Taylor Mutch <taylormutch@gmail.com>

TaylorMutch force-pushed the 1603-gateway-managed-token-refresh/tm branch from df3f0bc to 7afe767 Compare June 1, 2026 16:08

fix(server): use sandbox phase accessor after rebase

2d872b7

Signed-off-by: Taylor Mutch <taylormutch@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(sandbox): refresh docker/podman/vm tokens in gateway#1639

fix(sandbox): refresh docker/podman/vm tokens in gateway#1639
TaylorMutch wants to merge 2 commits into
mainfrom
1603-gateway-managed-token-refresh/tm

TaylorMutch commented May 29, 2026

Uh oh!

github-actions Bot commented May 29, 2026

Uh oh!

maxamillion commented Jun 2, 2026 •

edited

Loading

Uh oh!

TaylorMutch commented Jun 2, 2026

Uh oh!

maxamillion commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

TaylorMutch commented May 29, 2026

Summary

Related Issue

Changes

Testing

Checklist

Uh oh!

github-actions Bot commented May 29, 2026

Uh oh!

maxamillion commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TaylorMutch commented Jun 2, 2026

Uh oh!

maxamillion commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

maxamillion commented Jun 2, 2026 •

edited

Loading